Assignment 1

Due Date

This assignment is due by midnight Pacific Time, September 27th, 2024.

Learning Goals

  • Download, read, and get familiar with an external dataset.
  • Step through the EDA “checklist” presented in class
  • Practice making exploratory plots

Assignment Description

We will work with air pollution data from the U.S. Environmental Protection Agency (EPA). The EPA has a national monitoring network of air pollution sites that The primary question you will answer is whether daily concentrations of PM\(_{2.5}\) (particulate matter air pollution with aerodynamic diameter less than 2.5 \(\mu\)m) have decreased in California over the last 20 years (from 2002 to 2022).

A primer on particulate matter air pollution can be found here.

Your assignment should be completed in Quarto or R Markdown.

Steps

  1. Given the formulated question from the assignment description, you will now conduct EDA Checklist items 2-4. First, download 2002 and 2022 data for all sites in California from the EPA Air Quality Data website. Read in the data using data.table(). For each of the two datasets, check the dimensions, headers, footers, variable names and variable types. Check for any data issues, particularly in the key variable we are analyzing. Make sure you write up a summary of all of your findings.

    Read Tables into R

    DT2002 <- data.table::fread("ad_viz_plotval_data.csv")
    DT2022 <- data.table::fread("ad_viz_plotval_data (1).csv")

    Check the dimensions, headers, footers, variable names and variable types for 2002

    dim(DT2002)
    [1] 15976    22
    head(DT2002)
             Date Source  Site ID   POC Daily Mean PM2.5 Concentration    Units
           <char> <char>    <int> <int>                          <num>   <char>
    1: 01/05/2002    AQS 60010007     1                           25.1 ug/m3 LC
    2: 01/06/2002    AQS 60010007     1                           31.6 ug/m3 LC
    3: 01/08/2002    AQS 60010007     1                           21.4 ug/m3 LC
    4: 01/11/2002    AQS 60010007     1                           25.9 ug/m3 LC
    5: 01/14/2002    AQS 60010007     1                           34.5 ug/m3 LC
    6: 01/17/2002    AQS 60010007     1                           41.0 ug/m3 LC
       Daily AQI Value Local Site Name Daily Obs Count Percent Complete
                 <int>          <char>           <int>            <num>
    1:              81       Livermore               1              100
    2:              93       Livermore               1              100
    3:              74       Livermore               1              100
    4:              82       Livermore               1              100
    5:              98       Livermore               1              100
    6:             115       Livermore               1              100
       AQS Parameter Code AQS Parameter Description Method Code
                    <int>                    <char>       <int>
    1:              88101  PM2.5 - Local Conditions         120
    2:              88101  PM2.5 - Local Conditions         120
    3:              88101  PM2.5 - Local Conditions         120
    4:              88101  PM2.5 - Local Conditions         120
    5:              88101  PM2.5 - Local Conditions         120
    6:              88101  PM2.5 - Local Conditions         120
                          Method Description CBSA Code
                                      <char>     <int>
    1: Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
    2: Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
    3: Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
    4: Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
    5: Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
    6: Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
                               CBSA Name State FIPS Code      State
                                  <char>           <int>     <char>
    1: San Francisco-Oakland-Hayward, CA               6 California
    2: San Francisco-Oakland-Hayward, CA               6 California
    3: San Francisco-Oakland-Hayward, CA               6 California
    4: San Francisco-Oakland-Hayward, CA               6 California
    5: San Francisco-Oakland-Hayward, CA               6 California
    6: San Francisco-Oakland-Hayward, CA               6 California
       County FIPS Code  County Site Latitude Site Longitude
                  <int>  <char>         <num>          <num>
    1:                1 Alameda      37.68753      -121.7842
    2:                1 Alameda      37.68753      -121.7842
    3:                1 Alameda      37.68753      -121.7842
    4:                1 Alameda      37.68753      -121.7842
    5:                1 Alameda      37.68753      -121.7842
    6:                1 Alameda      37.68753      -121.7842
    tail(DT2002)
             Date Source  Site ID   POC Daily Mean PM2.5 Concentration    Units
           <char> <char>    <int> <int>                          <num>   <char>
    1: 12/10/2002    AQS 61131003     1                             15 ug/m3 LC
    2: 12/13/2002    AQS 61131003     1                             15 ug/m3 LC
    3: 12/22/2002    AQS 61131003     1                              1 ug/m3 LC
    4: 12/25/2002    AQS 61131003     1                             23 ug/m3 LC
    5: 12/28/2002    AQS 61131003     1                              5 ug/m3 LC
    6: 12/31/2002    AQS 61131003     1                              6 ug/m3 LC
       Daily AQI Value      Local Site Name Daily Obs Count Percent Complete
                 <int>               <char>           <int>            <num>
    1:              62 Woodland-Gibson Road               1              100
    2:              62 Woodland-Gibson Road               1              100
    3:               6 Woodland-Gibson Road               1              100
    4:              77 Woodland-Gibson Road               1              100
    5:              28 Woodland-Gibson Road               1              100
    6:              33 Woodland-Gibson Road               1              100
       AQS Parameter Code AQS Parameter Description Method Code
                    <int>                    <char>       <int>
    1:              88101  PM2.5 - Local Conditions         117
    2:              88101  PM2.5 - Local Conditions         117
    3:              88101  PM2.5 - Local Conditions         117
    4:              88101  PM2.5 - Local Conditions         117
    5:              88101  PM2.5 - Local Conditions         117
    6:              88101  PM2.5 - Local Conditions         117
                          Method Description CBSA Code
                                      <char>     <int>
    1: R & P Model 2000 PM2.5 Sampler w/WINS     40900
    2: R & P Model 2000 PM2.5 Sampler w/WINS     40900
    3: R & P Model 2000 PM2.5 Sampler w/WINS     40900
    4: R & P Model 2000 PM2.5 Sampler w/WINS     40900
    5: R & P Model 2000 PM2.5 Sampler w/WINS     40900
    6: R & P Model 2000 PM2.5 Sampler w/WINS     40900
                                     CBSA Name State FIPS Code      State
                                        <char>           <int>     <char>
    1: Sacramento--Roseville--Arden-Arcade, CA               6 California
    2: Sacramento--Roseville--Arden-Arcade, CA               6 California
    3: Sacramento--Roseville--Arden-Arcade, CA               6 California
    4: Sacramento--Roseville--Arden-Arcade, CA               6 California
    5: Sacramento--Roseville--Arden-Arcade, CA               6 California
    6: Sacramento--Roseville--Arden-Arcade, CA               6 California
       County FIPS Code County Site Latitude Site Longitude
                  <int> <char>         <num>          <num>
    1:              113   Yolo      38.66121      -121.7327
    2:              113   Yolo      38.66121      -121.7327
    3:              113   Yolo      38.66121      -121.7327
    4:              113   Yolo      38.66121      -121.7327
    5:              113   Yolo      38.66121      -121.7327
    6:              113   Yolo      38.66121      -121.7327
    str(DT2002)
    Classes 'data.table' and 'data.frame':  15976 obs. of  22 variables:
     $ Date                          : chr  "01/05/2002" "01/06/2002" "01/08/2002" "01/11/2002" ...
     $ Source                        : chr  "AQS" "AQS" "AQS" "AQS" ...
     $ Site ID                       : int  60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 ...
     $ POC                           : int  1 1 1 1 1 1 1 1 1 1 ...
     $ Daily Mean PM2.5 Concentration: num  25.1 31.6 21.4 25.9 34.5 41 29.3 15 18.8 37.9 ...
     $ Units                         : chr  "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" ...
     $ Daily AQI Value               : int  81 93 74 82 98 115 89 62 69 107 ...
     $ Local Site Name               : chr  "Livermore" "Livermore" "Livermore" "Livermore" ...
     $ Daily Obs Count               : int  1 1 1 1 1 1 1 1 1 1 ...
     $ Percent Complete              : num  100 100 100 100 100 100 100 100 100 100 ...
     $ AQS Parameter Code            : int  88101 88101 88101 88101 88101 88101 88101 88101 88101 88101 ...
     $ AQS Parameter Description     : chr  "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" ...
     $ Method Code                   : int  120 120 120 120 120 120 120 120 120 120 ...
     $ Method Description            : chr  "Andersen RAAS2.5-300 PM2.5 SEQ w/WINS" "Andersen RAAS2.5-300 PM2.5 SEQ w/WINS" "Andersen RAAS2.5-300 PM2.5 SEQ w/WINS" "Andersen RAAS2.5-300 PM2.5 SEQ w/WINS" ...
     $ CBSA Code                     : int  41860 41860 41860 41860 41860 41860 41860 41860 41860 41860 ...
     $ CBSA Name                     : chr  "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" ...
     $ State FIPS Code               : int  6 6 6 6 6 6 6 6 6 6 ...
     $ State                         : chr  "California" "California" "California" "California" ...
     $ County FIPS Code              : int  1 1 1 1 1 1 1 1 1 1 ...
     $ County                        : chr  "Alameda" "Alameda" "Alameda" "Alameda" ...
     $ Site Latitude                 : num  37.7 37.7 37.7 37.7 37.7 ...
     $ Site Longitude                : num  -122 -122 -122 -122 -122 ...
     - attr(*, ".internal.selfref")=<externalptr> 

There are 15976 rows and 22 columns for the 2002 data set. The header and footer are properly loaded with no apparent missing data.

Variable names are Date, Source, Site ID, POC, Daily Mean PM2.5 Concentration, Units, Daily AQI Value, Local Site Name, Daily Obs Count, Percent Complete, AQS Parameter Code, AQS, Parameter Description, Method Code, Method Description, CBSA Code, CBSA Name, State FIPS Code, State, County FIPS Code, County, Site Latitude, and Site Longitude.

Categorical variables: Date, Source, Units, Local Site Nemw, AQS Parameter Description, Method Description, CBSA Name, State, and County.

Numeric variables: Site ID, POC, Daily Mean PM2.5 Concentration, Daily AQI Value, Daily Obs Count, Percent Complete, AQS Parameter Code, AQS, Method Code, CBSA Code, State FIPS Code, County FIPS Code, Site Latitude, and Site Longitude.

Check the dimensions, headers, footers, variable names and variable types for 2022

::: {.cell}

```{.r .cell-code}
dim(DT2022)
```

::: {.cell-output .cell-output-stdout}

```
[1] 59756    22
```


:::

```{.r .cell-code}
head(DT2022)
```

::: {.cell-output .cell-output-stdout}

```
         Date Source  Site ID   POC Daily Mean PM2.5 Concentration    Units
       <char> <char>    <int> <int>                          <num>   <char>
1: 01/01/2022    AQS 60010007     3                           12.7 ug/m3 LC
2: 01/02/2022    AQS 60010007     3                           13.9 ug/m3 LC
3: 01/03/2022    AQS 60010007     3                            7.1 ug/m3 LC
4: 01/04/2022    AQS 60010007     3                            3.7 ug/m3 LC
5: 01/05/2022    AQS 60010007     3                            4.2 ug/m3 LC
6: 01/06/2022    AQS 60010007     3                            3.8 ug/m3 LC
   Daily AQI Value Local Site Name Daily Obs Count Percent Complete
             <int>          <char>           <int>            <num>
1:              58       Livermore               1              100
2:              60       Livermore               1              100
3:              39       Livermore               1              100
4:              21       Livermore               1              100
5:              23       Livermore               1              100
6:              21       Livermore               1              100
   AQS Parameter Code AQS Parameter Description Method Code
                <int>                    <char>       <int>
1:              88101  PM2.5 - Local Conditions         170
2:              88101  PM2.5 - Local Conditions         170
3:              88101  PM2.5 - Local Conditions         170
4:              88101  PM2.5 - Local Conditions         170
5:              88101  PM2.5 - Local Conditions         170
6:              88101  PM2.5 - Local Conditions         170
                     Method Description CBSA Code
                                 <char>     <int>
1: Met One BAM-1020 Mass Monitor w/VSCC     41860
2: Met One BAM-1020 Mass Monitor w/VSCC     41860
3: Met One BAM-1020 Mass Monitor w/VSCC     41860
4: Met One BAM-1020 Mass Monitor w/VSCC     41860
5: Met One BAM-1020 Mass Monitor w/VSCC     41860
6: Met One BAM-1020 Mass Monitor w/VSCC     41860
                           CBSA Name State FIPS Code      State
                              <char>           <int>     <char>
1: San Francisco-Oakland-Hayward, CA               6 California
2: San Francisco-Oakland-Hayward, CA               6 California
3: San Francisco-Oakland-Hayward, CA               6 California
4: San Francisco-Oakland-Hayward, CA               6 California
5: San Francisco-Oakland-Hayward, CA               6 California
6: San Francisco-Oakland-Hayward, CA               6 California
   County FIPS Code  County Site Latitude Site Longitude
              <int>  <char>         <num>          <num>
1:                1 Alameda      37.68753      -121.7842
2:                1 Alameda      37.68753      -121.7842
3:                1 Alameda      37.68753      -121.7842
4:                1 Alameda      37.68753      -121.7842
5:                1 Alameda      37.68753      -121.7842
6:                1 Alameda      37.68753      -121.7842
```


:::

```{.r .cell-code}
tail(DT2022)
```

::: {.cell-output .cell-output-stdout}

```
         Date Source  Site ID   POC Daily Mean PM2.5 Concentration    Units
       <char> <char>    <int> <int>                          <num>   <char>
1: 12/01/2022    AQS 61131003     1                            3.4 ug/m3 LC
2: 12/07/2022    AQS 61131003     1                            3.8 ug/m3 LC
3: 12/13/2022    AQS 61131003     1                            6.0 ug/m3 LC
4: 12/19/2022    AQS 61131003     1                           34.8 ug/m3 LC
5: 12/25/2022    AQS 61131003     1                           23.2 ug/m3 LC
6: 12/31/2022    AQS 61131003     1                            1.0 ug/m3 LC
   Daily AQI Value      Local Site Name Daily Obs Count Percent Complete
             <int>               <char>           <int>            <num>
1:              19 Woodland-Gibson Road               1              100
2:              21 Woodland-Gibson Road               1              100
3:              33 Woodland-Gibson Road               1              100
4:              99 Woodland-Gibson Road               1              100
5:              77 Woodland-Gibson Road               1              100
6:               6 Woodland-Gibson Road               1              100
   AQS Parameter Code AQS Parameter Description Method Code
                <int>                    <char>       <int>
1:              88101  PM2.5 - Local Conditions         145
2:              88101  PM2.5 - Local Conditions         145
3:              88101  PM2.5 - Local Conditions         145
4:              88101  PM2.5 - Local Conditions         145
5:              88101  PM2.5 - Local Conditions         145
6:              88101  PM2.5 - Local Conditions         145
                                      Method Description CBSA Code
                                                  <char>     <int>
1: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
2: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
3: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
4: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
5: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
6: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
                                 CBSA Name State FIPS Code      State
                                    <char>           <int>     <char>
1: Sacramento--Roseville--Arden-Arcade, CA               6 California
2: Sacramento--Roseville--Arden-Arcade, CA               6 California
3: Sacramento--Roseville--Arden-Arcade, CA               6 California
4: Sacramento--Roseville--Arden-Arcade, CA               6 California
5: Sacramento--Roseville--Arden-Arcade, CA               6 California
6: Sacramento--Roseville--Arden-Arcade, CA               6 California
   County FIPS Code County Site Latitude Site Longitude
              <int> <char>         <num>          <num>
1:              113   Yolo      38.66121      -121.7327
2:              113   Yolo      38.66121      -121.7327
3:              113   Yolo      38.66121      -121.7327
4:              113   Yolo      38.66121      -121.7327
5:              113   Yolo      38.66121      -121.7327
6:              113   Yolo      38.66121      -121.7327
```


:::

```{.r .cell-code}
str(DT2022)
```

::: {.cell-output .cell-output-stdout}

```
Classes 'data.table' and 'data.frame':  59756 obs. of  22 variables:
 $ Date                          : chr  "01/01/2022" "01/02/2022" "01/03/2022" "01/04/2022" ...
 $ Source                        : chr  "AQS" "AQS" "AQS" "AQS" ...
 $ Site ID                       : int  60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 ...
 $ POC                           : int  3 3 3 3 3 3 3 3 3 3 ...
 $ Daily Mean PM2.5 Concentration: num  12.7 13.9 7.1 3.7 4.2 3.8 2.3 6.9 13.6 11.2 ...
 $ Units                         : chr  "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" ...
 $ Daily AQI Value               : int  58 60 39 21 23 21 13 38 59 55 ...
 $ Local Site Name               : chr  "Livermore" "Livermore" "Livermore" "Livermore" ...
 $ Daily Obs Count               : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Percent Complete              : num  100 100 100 100 100 100 100 100 100 100 ...
 $ AQS Parameter Code            : int  88101 88101 88101 88101 88101 88101 88101 88101 88101 88101 ...
 $ AQS Parameter Description     : chr  "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" ...
 $ Method Code                   : int  170 170 170 170 170 170 170 170 170 170 ...
 $ Method Description            : chr  "Met One BAM-1020 Mass Monitor w/VSCC" "Met One BAM-1020 Mass Monitor w/VSCC" "Met One BAM-1020 Mass Monitor w/VSCC" "Met One BAM-1020 Mass Monitor w/VSCC" ...
 $ CBSA Code                     : int  41860 41860 41860 41860 41860 41860 41860 41860 41860 41860 ...
 $ CBSA Name                     : chr  "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" ...
 $ State FIPS Code               : int  6 6 6 6 6 6 6 6 6 6 ...
 $ State                         : chr  "California" "California" "California" "California" ...
 $ County FIPS Code              : int  1 1 1 1 1 1 1 1 1 1 ...
 $ County                        : chr  "Alameda" "Alameda" "Alameda" "Alameda" ...
 $ Site Latitude                 : num  37.7 37.7 37.7 37.7 37.7 ...
 $ Site Longitude                : num  -122 -122 -122 -122 -122 ...
 - attr(*, ".internal.selfref")=<externalptr> 
```


:::
:::

There are 59756 rows and 22 columns for the 2022 data set. The header and footer are properly loaded with no apparent missing data. All variable names and types are the same as in the 2002 data set.

  1. Combine the two years of data into one data frame. Use the Date variable to create a new column for year, which will serve as an identifier. Change the names of the key variables so that they are easier to refer to in your code.

    library(dplyr)
    
    Attaching package: 'dplyr'
    The following objects are masked from 'package:stats':
    
        filter, lag
    The following objects are masked from 'package:base':
    
        intersect, setdiff, setequal, union
    # Combine the tables
    DT <- rbind(DT2002, DT2022)
    
    # Create a new column for Year
    DT$Date <- as.Date(DT$Date, format = "%m/%d/%Y")
    DT$Year <- format(DT$Date, "%Y")
    
    # Change the names of key variables
    DT <- DT |>
      rename("PM2.5" = "Daily Mean PM2.5 Concentration", "lat" = "Site Latitude", "lon" = "Site Longitude")
    DT$Year <- as.numeric(as.character(DT$Year))
    
    # Double check variables
    head(DT)
             Date Source  Site ID   POC PM2.5    Units Daily AQI Value
           <Date> <char>    <int> <int> <num>   <char>           <int>
    1: 2002-01-05    AQS 60010007     1  25.1 ug/m3 LC              81
    2: 2002-01-06    AQS 60010007     1  31.6 ug/m3 LC              93
    3: 2002-01-08    AQS 60010007     1  21.4 ug/m3 LC              74
    4: 2002-01-11    AQS 60010007     1  25.9 ug/m3 LC              82
    5: 2002-01-14    AQS 60010007     1  34.5 ug/m3 LC              98
    6: 2002-01-17    AQS 60010007     1  41.0 ug/m3 LC             115
       Local Site Name Daily Obs Count Percent Complete AQS Parameter Code
                <char>           <int>            <num>              <int>
    1:       Livermore               1              100              88101
    2:       Livermore               1              100              88101
    3:       Livermore               1              100              88101
    4:       Livermore               1              100              88101
    5:       Livermore               1              100              88101
    6:       Livermore               1              100              88101
       AQS Parameter Description Method Code                    Method Description
                          <char>       <int>                                <char>
    1:  PM2.5 - Local Conditions         120 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS
    2:  PM2.5 - Local Conditions         120 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS
    3:  PM2.5 - Local Conditions         120 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS
    4:  PM2.5 - Local Conditions         120 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS
    5:  PM2.5 - Local Conditions         120 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS
    6:  PM2.5 - Local Conditions         120 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS
       CBSA Code                         CBSA Name State FIPS Code      State
           <int>                            <char>           <int>     <char>
    1:     41860 San Francisco-Oakland-Hayward, CA               6 California
    2:     41860 San Francisco-Oakland-Hayward, CA               6 California
    3:     41860 San Francisco-Oakland-Hayward, CA               6 California
    4:     41860 San Francisco-Oakland-Hayward, CA               6 California
    5:     41860 San Francisco-Oakland-Hayward, CA               6 California
    6:     41860 San Francisco-Oakland-Hayward, CA               6 California
       County FIPS Code  County      lat       lon  Year
                  <int>  <char>    <num>     <num> <num>
    1:                1 Alameda 37.68753 -121.7842  2002
    2:                1 Alameda 37.68753 -121.7842  2002
    3:                1 Alameda 37.68753 -121.7842  2002
    4:                1 Alameda 37.68753 -121.7842  2002
    5:                1 Alameda 37.68753 -121.7842  2002
    6:                1 Alameda 37.68753 -121.7842  2002
    tail(DT)
             Date Source  Site ID   POC PM2.5    Units Daily AQI Value
           <Date> <char>    <int> <int> <num>   <char>           <int>
    1: 2022-12-01    AQS 61131003     1   3.4 ug/m3 LC              19
    2: 2022-12-07    AQS 61131003     1   3.8 ug/m3 LC              21
    3: 2022-12-13    AQS 61131003     1   6.0 ug/m3 LC              33
    4: 2022-12-19    AQS 61131003     1  34.8 ug/m3 LC              99
    5: 2022-12-25    AQS 61131003     1  23.2 ug/m3 LC              77
    6: 2022-12-31    AQS 61131003     1   1.0 ug/m3 LC               6
            Local Site Name Daily Obs Count Percent Complete AQS Parameter Code
                     <char>           <int>            <num>              <int>
    1: Woodland-Gibson Road               1              100              88101
    2: Woodland-Gibson Road               1              100              88101
    3: Woodland-Gibson Road               1              100              88101
    4: Woodland-Gibson Road               1              100              88101
    5: Woodland-Gibson Road               1              100              88101
    6: Woodland-Gibson Road               1              100              88101
       AQS Parameter Description Method Code
                          <char>       <int>
    1:  PM2.5 - Local Conditions         145
    2:  PM2.5 - Local Conditions         145
    3:  PM2.5 - Local Conditions         145
    4:  PM2.5 - Local Conditions         145
    5:  PM2.5 - Local Conditions         145
    6:  PM2.5 - Local Conditions         145
                                          Method Description CBSA Code
                                                      <char>     <int>
    1: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
    2: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
    3: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
    4: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
    5: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
    6: R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
                                     CBSA Name State FIPS Code      State
                                        <char>           <int>     <char>
    1: Sacramento--Roseville--Arden-Arcade, CA               6 California
    2: Sacramento--Roseville--Arden-Arcade, CA               6 California
    3: Sacramento--Roseville--Arden-Arcade, CA               6 California
    4: Sacramento--Roseville--Arden-Arcade, CA               6 California
    5: Sacramento--Roseville--Arden-Arcade, CA               6 California
    6: Sacramento--Roseville--Arden-Arcade, CA               6 California
       County FIPS Code County      lat       lon  Year
                  <int> <char>    <num>     <num> <num>
    1:              113   Yolo 38.66121 -121.7327  2022
    2:              113   Yolo 38.66121 -121.7327  2022
    3:              113   Yolo 38.66121 -121.7327  2022
    4:              113   Yolo 38.66121 -121.7327  2022
    5:              113   Yolo 38.66121 -121.7327  2022
    6:              113   Yolo 38.66121 -121.7327  2022
  2. Create a basic map in leaflet() that shows the locations of the sites (make sure to use different colors for each year). Summarize the spatial distribution of the monitoring sites.

    library("leaflet")
    color_palette <- colorNumeric(palette = "viridis", domain = DT$Year)
    leaflet(DT) |>
      addProviderTiles('OpenStreetMap') |>
      addCircles(lat=~lat,
                 lng=~lon, 
                 opacity=1, 
                 fillOpacity=1, 
                 radius=100, 
                 color=~color_palette(Year),
                 fillColor=~color_palette(Year),)

Monitoring sites appear to be distributed with a higher density in locations of higher population density. For instance, cities and along the coastline have many more monitoring sites than in the mountain ranges. Specifically, there is a high density around the Bay Area, Los Angeles, and San Diago. This seems logical because we would like to know air pollution levels where people are living and more likely to pollute the air.

  1. Check for any missing or implausible values of PM\(_{2.5}\) in the combined dataset. Explore the proportions of each and provide a summary of any temporal patterns you see in these observations.

    sum(is.na(DT$PM2.5))
    [1] 0
    # There are no missing values of PM2.5.
    
    summary(DT$PM2.5)
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      -6.70    4.50    7.60   10.05   12.20  302.50 
    # Set a maximum value of 500 ug/m^3 (as given by the 2012 EPA) and a minimum value of 0:
    max_PM <- 500
    min_PM <- 0
    impossible <- DT$PM2.5[DT$PM2.5 < min_PM | DT$PM2.5 > max_PM]
    sum_impossible <- length(impossible)
    print(sum_impossible)
    [1] 215
    # It appears that all impossible values are very close to 0, so they will all be set at 0
    DT_new <- DT
    DT_new$PM2.5 <- ifelse(DT_new$PM2.5 < 0, 0, DT_new$PM2.5)
    
    # Find the proportion of impossible data
    prop <- sum_impossible/length(DT$PM2.5)*100
    print(prop)
    [1] 0.2838958
    # Only 0.284% of the data is impossible
    
    # Temporal summary
    temporal <- DT[DT$PM2.5 < 0, .(Count = .N), by = Date]
    temporal <- temporal[order(-Count)]
    print(temporal)
               Date Count
             <Date> <int>
      1: 2022-12-31    11
      2: 2022-07-06     8
      3: 2022-12-30     8
      4: 2022-09-19     7
      5: 2022-12-11     6
     ---                 
    114: 2022-02-08     1
    115: 2022-02-10     1
    116: 2022-01-15     1
    117: 2022-02-02     1
    118: 2022-04-23     1
    # All missing values are from 2022
  2. Explore the main question of interest at three different spatial levels. Create exploratory plots (e.g. boxplots, histograms, line plots) and summary statistics that best suit each level of data. Be sure to write up explanations of what you observe in these data.

    • state

      DT_new$year <- as.factor(DT_new$Year)
      library(ggplot2)
      # Create a histogram
      ggplot(data = DT_new) + 
        geom_histogram(aes(x = PM2.5, fill = year)) +
        labs(title = "PM2.5 by Year in California", x = "PM2.5")
      `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

      # Create a box plot
      ggplot(data = DT_new) + 
        geom_boxplot(aes(x = year, y = PM2.5, fill = year)) +
        labs(title = "PM2.5 by Year in California", x = "Year", y = "PM2.5")

      # Statistical summary of state data
      State <- DT_new[, .(Mean = mean(PM2.5), Min = min(PM2.5), Max = max(PM2.5), IQR = IQR(PM2.5)), by = Year]
      print(State)
          Year      Mean   Min   Max   IQR
         <num>     <num> <num> <num> <num>
      1:  2002 16.115943     0 104.3  13.5
      2:  2022  8.431138     0 302.5   6.6

      Overall, the average PM2.5 for the state has decreased with an average PM2.5 of 8.4 in 2022 compared to 16.1 in 2002 and a decreased interquartile range from 13.5 to 6.6, meaning there is less average variation in the 2022 data compared to 2002. There are some counties with much higher, outlying values in 2022, slightly increasing this mean. Therefore, some counties may have worse pollution in 2022 compared to 2002, but the state as a whole has decreased air pollution levels.

    • county

      # Create a bar graph
      ggplot(data = DT_new, aes(x = County, y = PM2.5, fill = year)) +
        geom_bar(stat = "identity", position = "dodge") +
        labs(title = "Pm2.5 Trends by County", x = "County", y= "PM2.5") +
        coord_flip()

      # Create a heat map
      ggplot(DT_new, aes(x = year, y = County, fill = PM2.5)) +
        geom_tile() +
        scale_fill_gradient(low = "white", high = "red") +
        labs(title = "PM2.5 2002 and 2022 by County", x = "Year", y = "County")

      # Statistial summary of county data
      County_stats <- DT_new |>
        group_by(year, County) |>
        summarize(Mean = mean(PM2.5), Max = max(PM2.5), Min = min(PM2.5), .groups = "drop") |>
        arrange(County)
      print(County_stats)
      # A tibble: 98 × 5
         year  County        Mean   Max   Min
         <fct> <chr>        <dbl> <dbl> <dbl>
       1 2002  Alameda      14.3   61.6   1.9
       2 2022  Alameda       8.20  35.5   0  
       3 2002  Butte        14.8   88     1  
       4 2022  Butte         6.19  42.8   0  
       5 2002  Calaveras     9.9   40     2  
       6 2022  Calaveras     6.04  25.9   0  
       7 2002  Colusa       11.7   57     1  
       8 2022  Colusa        7.61  37     0.6
       9 2002  Contra Costa 15.1   76.7   2  
      10 2022  Contra Costa  8.25  37.3   0.9
      # ℹ 88 more rows

      Many counties with PM2.5 measurements in both 2002 and 2022 decreased their average PM2.5 values, as seen in the table and charts. Counties with very high PM2.5 values around 300 ug/m^3, in 2022 identified as outliers at a state level, have relatively low mean values, meaning one timepoint in a few counties are significantly impacting the state PM2.5 values at a state level. Thus, overall, most California counties decreased their average PM2.5 leading to lower air pollution levels in California as a whole.

    • site in Los Angeles

      # Create a new table for LA data
      LA <- DT_new |>
      filter(County == "Los Angeles")
      LA_2002 <- LA |> filter(year == "2002")
      LA_2022 <- LA |> filter(year == "2022")
      
      # Create a line plot for LA PM2.5 in 2002 and 2022
      library(gridExtra)
      
      Attaching package: 'gridExtra'
      The following object is masked from 'package:dplyr':
      
          combine
      plot_2002 <- ggplot(LA_2002, aes(x = Date, y = PM2.5)) +
        geom_line(color = "blue") +
        geom_point(color = "blue") +
        labs(title = "Change in PM2.5 in Los Angeles in 2002", x = "Date in 2002", y = "PM2.5")
      plot_2022 <- ggplot(LA_2022, aes(x = Date, y = PM2.5)) +
        geom_line(color = "red") +
        geom_point(color = "red") +
        labs(title = "Change in PM2.5 in Los Angeles in 2022", x = "Date in 2022", y = "PM2.5")
      grid.arrange(plot_2002, plot_2022, ncol = 2)

      # Create a box plot for 2002 and 2022
      ggplot(LA, aes(x = year, y = PM2.5)) +
        geom_boxplot(fill = "lightblue", color = "black") +
        labs(title = "PM2.5 Levels in Los Angeles in 2002 vs 2022", x = "Year", y = "PM2.5")

    # Statistical summary of LA data
    LA_stats <- LA[, .(Mean = mean(PM2.5), Min = min(PM2.5), Max = max(PM2.5), IQR = IQR(PM2.5)), by = year]
    print(LA_stats)
         year     Mean   Min   Max   IQR
       <fctr>    <num> <num> <num> <num>
    1:   2002 19.65604   0.6  72.4  14.4
    2:   2022 10.97258   0.0  56.0   6.3

    Los Angeles significantly decreased PM2.5 levels in 2022 compared to 2002 with lower mean levels, 11 ug/m3 in 2022 vs 20 ug/m^3 in 2002, and less variation with an IQR of 14.4 in 2002 and 6.3 in 2022. Additionally, unlike some other counties, even the high outliers decreased between 2002 and 2022 with a decreased maximum PM2.5 value of 72.4 in 2002 to 56.0 in 2022. Therefore, Los Angeles county shows decreased air pollution values in 2022 compared to 2002, which is impressive for such a large and populated region.


This homework has been adapted from the case study in Roger Peng’s Exploratory Data Analysis with R